This IPython notebook illustrates how to down sample two large tables that are loaded in the memory



In [1]:

    
import py_entitymatching as em









    



/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Down sampling is typically done when the input tables are large (e.g. each containing more than 100K tuples). For the purposes of this notebook we will use two large datasets: Citeseer and DBLP. You can download Citeseer dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/citeseer.csv and DBLP dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/dblp.csv. Once downloaded, save these datasets as 'citeseer.csv' and 'dblp.csv' in the current directory.



In [5]:

    
# Read the CSV files
A = em.read_csv_metadata('./citeseer.csv',low_memory=False) # setting the parameter low_memory to False  to speed up loading.
B = em.read_csv_metadata('./dblp.csv', low_memory=False)



In [6]:

    
len(A), len(B)









    Out[6]:





(1823978, 2512927)



In [7]:

    
A.head()









    Out[7]:






  
    
      
      id
      title
      authors
      journal
      month
      year
      publication_type
    
  
  
    
      0
      1
      An Arithmetic Analogue of Bezouts Theorem
      David Mckinnon
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2
      Thompsons Group F is Not Minimally Almost Convex
      James Belk, Kai-uwe Bux
      NaN
      NaN
      2002.0
      NaN
    
    
      2
      3
      Cognitive Dimensions Tradeoffs in Tangible User Interface Design
      Darren Edge, Alan Blackwell
      NaN
      NaN
      NaN
      NaN
    
    
      3
      4
      ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session...
      J. Bohnemeyer, Max Planck, I. Introduction
      NaN
      NaN
      2002.0
      NaN
    
    
      4
      5
      PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND
      J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz
      NaN
      NaN
      NaN
      NaN



In [8]:

    
B.head()









    Out[8]:






  
    
      
      id
      title
      authors
      journal
      month
      year
      publication_type
    
  
  
    
      0
      1
      Klaus Tschira Stiftung gemeinntzige GmbH, KTS
      Klaus Tschira
      NaN
      NaN
      2012
      www
    
    
      1
      2
      The SGML/XML Web Page
      Robin Cover
      NaN
      NaN
      2006
      www
    
    
      2
      3
      The Future of Classic Data Administration: Objects + Databases + CASE
      Arnon Rosenthal
      NaN
      NaN
      1998
      www
    
    
      3
      4
      XML Query Data Model
      Mary F. Fernandez, Jonathan Robie
      NaN
      NaN
      2001
      www
    
    
      4
      5
      The XML Query Algebra
      Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler
      NaN
      NaN
      2001
      www



In [9]:

    
# Set 'id' as the keys to the input tables
em.set_key(A, 'id')
em.set_key(B, 'id')









    Out[9]:





True



In [10]:

    
# Display the keys
em.get_key(A), em.get_key(B)









    Out[10]:





('id', 'id')



In [12]:

    
# Downsample the datasets 
sample_A, sample_B = em.down_sample(A, B, size=1000, y_param=1)

In the down_sample command, set the size to the number of tuples that should be sampled from B (this would be the size of sampled B table) and set the y_param to be the number of matching tuples to be picked from A.

In the above, we set the number of tuples to be sampled from B to be 1000. We set the y_param to 1 meaning that for each tuple sampled from B pick one matching tuple from A.



In [13]:

    
# Display the lengths of sampled datasets
len(sample_A), len(sample_B)

Now, the input tables A and B (with 1.8M and 2.5M tuples) are down sampled to smaller tables sample_A and sample_B (with ).

	id	title	authors	journal	month	year	publication_type
0	1	An Arithmetic Analogue of Bezouts Theorem	David Mckinnon	NaN	NaN	NaN	NaN
1	2	Thompsons Group F is Not Minimally Almost Convex	James Belk, Kai-uwe Bux	NaN	NaN	2002.0	NaN
2	3	Cognitive Dimensions Tradeoffs in Tangible User Interface Design	Darren Edge, Alan Blackwell	NaN	NaN	NaN	NaN
3	4	ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session...	J. Bohnemeyer, Max Planck, I. Introduction	NaN	NaN	2002.0	NaN
4	5	PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND	J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz	NaN	NaN	NaN	NaN

	id	title	authors	journal	month	year	publication_type
0	1	Klaus Tschira Stiftung gemeinntzige GmbH, KTS	Klaus Tschira	NaN	NaN	2012	www
1	2	The SGML/XML Web Page	Robin Cover	NaN	NaN	2006	www
2	3	The Future of Classic Data Administration: Objects + Databases + CASE	Arnon Rosenthal	NaN	NaN	1998	www
3	4	XML Query Data Model	Mary F. Fernandez, Jonathan Robie	NaN	NaN	2001	www
4	5	The XML Query Algebra	Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler	NaN	NaN	2001	www